Group Presentation

Survey Methodology II

Irantzu, Sophie, Diego, Gür

2025-03-19

Data cleaning

  • Libraries and cleaning Data

  • Creating a data frame with column names and labels

  • Convert into tidy Data

data <- data |> 
  mutate(
    qc19 = factor(qc19, levels = c(1, 2, 3), labels = c("Yes", "No", "DK")),
    d10 = case_when(
      d10 == 1 ~ 1,  # Man for best practice
      d10 == 2 ~ 0   # Woman
    ),
    d10 = factor(d10, levels = c(0, 1),
                 labels = c("Woman", "Man"))
  )

Descriptive Analysis

  • General descriptive analysis

  • Legal framework

  • Historical background

  • Cultural and Societal norms

  • Economical analysis

General descriptive analysis

  • Check for NA’s

  • Total observations 27438

  • Total countries 28

  • Target variable –> qc19: Opinion on whether transgender people should be able to change official documents

LGBT Index by Country

Dividing Europe in West/North/South and East

LGTB Index and Region

lm_model_interact <- lm(LGBT..Policy.Index ~ qc19 * region + qc20 * region, data = data)

summary(lm_model_interact)
  • North & South Europe: disagreement with qc19 is strongly associated with lower LGBT rights

  • Western Europe: the approval or disapproval of qc19 is not as important for the LGBT Policy Index -> other factors are more important there

Historical blocs

Generations

Religion

Political Ideology - self reports

Societal attitudes

  • Correlations between discrimination perceptions and demographic variables

Economic variables

  • Economic questions from the survey: internac_trade = qa1, trade_tariffs = qa12, trade_agreements = qa13, online_purchases = qa14, occupation = d15a, urban_rural = d25 or social_class = d63.

  • 3 new external variables per country:

    /- Actual individual consumption (AIC): value of products actually consumed by individuals.

    /- Unemployment rate

    /- Minimum wage

Multiple Imputation with MICE

  • Na´s in the min_wage variable

  • Most similar distribution is the lasso regression

Graph: International Trade Benefit.

International trade benefit -> - 1: yes, benefitting a lot - 2: yes, benefitting somewhat - 3: No, not really benefitting 4: No, not benefitting at all 5: DK.

Feature engineering + Multicollinearity

Model

  • Step: AIC=8607.78 qc19_numeric ~ country_name + qc20 + qc15_1 + qc15_2 + qc15_3 + qc17_3 + qc17_4 + qc18_2 + qc18_3 + qc1_4 + qc4_7 + qc4_8 + qc5_3 + d11 + d15a + d15b + d60 + qa1 + qa14 + d10
  • Scaled numerical values and removed non-significant variables
  • Checked for NA’s

Imputation

  • Replace NA’s in target variable with median
  • Apply MICE for the other variables
Class: mids
Number of multiple imputations:  5 
Imputation methods:
      d15b       d15a d11_scaled d60_scaled       d1r2 
     "pmm"         ""         ""         ""         "" 
PredictorMatrix:
           d15b d15a d11_scaled d60_scaled d1r2
d15b          0    1          1          1    1
d15a          1    0          1          1    1
d11_scaled    1    1          0          1    1
d60_scaled    1    1          1          0    1
d1r2          1    1          1          1    0

GLMER Model

glmer_model_alt <- glmer(qc19_numeric ~ d10 + d11_scaled + I(d11_scaled^2) + qc20 + 
                        qc15_1 + qc15_2 + qc15_3 + qc17_3 + qc17_4 + 
                        qc1_4 + qc4_7 + d15b + qa1 + (1 | country_name), 
                      data = data_selected, family = binomial,
                      control = glmerControl(optimizer = "nloptwrap", 
                                             optCtrl = list(maxfun = 300000)))
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: qc19_numeric ~ d10 + d11_scaled + I(d11_scaled^2) + qc20 + qc15_1 +  
    qc15_2 + qc15_3 + qc17_3 + qc17_4 + qc1_4 + qc4_7 + d15b +  
    qa1 + (1 | country_name)
   Data: data_selected
Control: glmerControl(optimizer = "nloptwrap", optCtrl = list(maxfun = 3e+05))

     AIC      BIC   logLik deviance df.resid 
   922.1    995.7   -446.1    892.1      985 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.1465 -0.5057  0.2908  0.4889  6.8873 

Random effects:
 Groups       Name        Variance Std.Dev.
 country_name (Intercept) 0        0       
Number of obs: 1000, groups:  country_name, 12

Fixed effects:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)      5.64126    0.55475  10.169  < 2e-16 ***
d10Man          -0.50964    0.16921  -3.012  0.00260 ** 
d11_scaled       0.56090    0.09102   6.163 7.15e-10 ***
I(d11_scaled^2) -0.08090    0.07619  -1.062  0.28837    
qc20            -1.25187    0.15513  -8.070 7.05e-16 ***
qc15_1          -0.28206    0.13809  -2.043  0.04110 *  
qc15_2          -0.45966    0.15483  -2.969  0.00299 ** 
qc15_3          -0.22142    0.14965  -1.480  0.13897    
qc17_3           0.02163    0.15354   0.141  0.88799    
qc17_4          -0.17544    0.13559  -1.294  0.19569    
qc1_4           -0.06158    0.09897  -0.622  0.53380    
qc4_7            0.44452    0.22073   2.014  0.04403 *  
d15b             0.03182    0.02689   1.183  0.23668    
qa1             -0.12665    0.09702  -1.305  0.19177    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

  • Multicollinearity revised with VIF –> <2 no problem
  • Model precision:
Metric Value
Accuracy 0.7940
Precision 0.8168
Recall 0.9228
F1-Score 0.8666

ROC / AUC

Random Forest

rf_model <- randomForest(qc19_numeric ~ ., data = data_selected, ntree = 500, importance = TRUE)

  • Training / testing (80-20)
Model Accuracy Precision Recall F1.Score
Random Forest 0.320 1.0000 0.0073 0.0145
Our Model 0.794 0.8168 0.9228 0.8666

Additional model with 2 more variables

glmer_model_alt2.0 <- glmer(qc19_numeric ~ d10 + d11_scaled + I(d11_scaled^2) + qc20 + 
                        qc15_1 + qc15_2 + qc15_3 + qc17_3 + qc17_4 + sd3 + d1r2 +
                        qc1_4 + qc4_7 + d15b + qa1 + (1 | country_name), 
                      data = data_selected, family = binomial,
                      control = glmerControl(optimizer = "nloptwrap", 
                                             optCtrl = list(maxfun = 300000)))
Metric Value
Accuracy 0.7880
Precision 0.8194
Recall 0.9076
F1-Score 0.8613